MAXIMIZING THE PROBABILITY OF ATTAINING A TARGET PRIOR TO 

EXTINCTION 
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Abstract. We present a dynamic programming-based solution to the problem of max- 
imizing the probability of attaining a target set before hitting a cemetery set for a 
discrete-time Markov control process. Under mild hypotheses we establish that there 
exists a deterministic stationary policy that achieves the maximum value of this prob- 
ability. We demonstrate how the maximization of this probability can be computed 
through the maximization of an expected total reward until the first hitting time to ei- 
ther the target or the cemetery set. Martingale characterizations of thrifty, equalizing, 
and optimal policies in the context of our problem are also established. 



§1. Introduction 

There are two basic categories of discrete-time controlled Markov processes that 
deal with random temporal horizons. The first is the well-known optimal stopping 
problem [Dynkin, 1963], in which the random horizon arises from some dynamic opti- 
mization protocol based on the past history of the process. The random 'stopping time' 
thus generated is regarded as a decision variable. This problem arises in, among other 
areas, stochastic analysis, mathematical statistics, mathematical finance, and financial 
engineering; see the comprehensive monograph [Peskir and Shiryaev, 2006] for details 
and further references. The second is relatively less common, and is characterized by 
the fact that the random horizon arises as a result of an endogenous event of the sto- 
chastic process, e.g., the process hitting a particular subset of the state-space, variations 
in the process paths crossing a certain threshold. This problem arises in, among oth- 
ers, optimization of target-level criteria [Bouakiz and Kebir, 1995; Dubins and Savage, 
1976], optimal control of retirement investment funds [Boda et al., 2004], minimiza- 
tion of ruin probabilities in insurance funds [Schmidli, 2008], 'satisfaction of needs' 
problems in economics [Simon, 1957], risk minimizing stopping problems [Ohtsubo, 
2003], attainability problems under stochastic perturbations [Digailova and Kurzhan- 
skii, 2004], and optimal control of Markov control processes up to an exit time [Borkar, 
1991]. 

The problem treated in this article falls under the second category above. In broad 
strokes, we consider a discrete-time Markov control process with Borel state and action 
spaces. We assume that there is a certain target set located inside a safe region, the 
latter being a subset of the state-space. The problem is to maximize the probability of 
attaining the target set before exiting the safe set (or equivalently hitting the cemetery 
set or unsafe region). This 'reach a good set while avoiding a bad set' formulation 
arises in, e.g., air traffic control, where aircraft try to reach their destination while 
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avoiding collision with other aircraft or the ground despite uncertain weather condi- 
tions. It also arises in portfolio optimization, where it is desired to reach a target level 
of wealth without falling below a certain baseline capital with high probability. Finally, 
it forms the core of the computation of safe sets for hybrid systems where the 'good' 
and the 'bad' sets represent states from which a discrete transition into the unsafe set 
is possible [Gao et al., 2007; Tomlin et al., 2000]. Special cases of this problem have 
been investigated in, e.g., [Prandini and Hu, 2006; Watkins and Lygeros, 2003] in the 
context of air traffic applications, [Abate et al., 2008; Prajna et al., 2007] in the context 
of probabilistic safety, [Boda et al., 2004] in the context of maximizing the probability 
of attaining a preassigned comfort level of retirement investment funds. 

It is clear from the description of our problem in the preceding paragraph that 
there are two random times involved, namely, the hitting times of the target and the 
cemetery sets. In this article we formulate our problem as the maximization of an 
expected total reward accumulated up to the minimum of these two hitting times. As 
such, this formulation falls under the broad framework of optimal control of Markov 
control processes up to an exit time, which has a long and rich history. It has mostly 
been studied as the minimization of an expected total cost until the first time that 
the state enters a given target set, see e.g., [Borkar, 1991, Chapter II], [Hernandez- 
Lerma and Lasserre, 1999, Chapter 8], and the references therein. In particular, if a 
unit cost is incurred as long as the state is outside the target set, then the problem of 
minimizing the cost accumulated until the state enters the target is known variously as 
the pursuit problem [Eaton andZadeh, 1962], transient programming [Whittle, 1983], 
the first passage problem [Derman, 1970; Kushner, 1971], the stochastic shortest path 
problem [Bertsekas, 2007], and control up to an exit time [Borkar, 1988, 1991; Kesten 
and Spitzer, 1975] . Here we exploit certain additional structures of our problem in the 
dynamic programming equations that we derive leading to methods fine-tuned to the 
particular problem at hand. 

Our main results center around the assertion that there exists a deterministic sta- 
tionary policy that maximizes the probability of hitting the target set before the ceme- 
tery set. This maximal probability as a function of the initial state is the optimal value 
function for our problem. We obtain a Bellman equation for our problem which is 
solved by the optimal value function. Furthermore, we provide martingale-theoretic 
conditions characterizing 'thrifty', 'equalizing', and optimal policies via methods de- 
rived from [Dubins and Savage, 1976; Karatzas and Sudderth, 2009]; see also [Zhu 
and Guo, 2006] and the references therein for martingale characterization of average 
optimality. The principal techniques employed in this article are similar to the ones 
in [Chatterjee et al., 2008], where the authors studied optimal control of a Markov 
control process up its first entry time to a safe set. In [Chatterjee et al., 2008] we 
developed a recovery strategy to enter a given target set from its exterior while mini- 
mizing a discounted cost. The problem was posed as one of minimizing the sum of a 
discounted cost-per-stage function c up to the first entry time t to a target set, namely, 

minimize [Xit=o at(: ( x t> a t)] over 3 c l ass °f admissible policies n, where a e ]0, 1[ 
is a discount factor. Here we extend this approach to problems with two sets, a target 
and a cemetery, and the case of a = 1. 

This article unfolds as follows. The main results are stated in §2. In §2.1 we define 
the general setting of the problem, namely, Markov control processes on Polish spaces, 
their transition kernels, and the admissible control strategies. In §2.2 we present our 
main Theorem (2.10) which guarantees the existence of a deterministic stationary 
policy that leads to the maximal probability of hitting the target set while avoiding the 
specified dangerous set, and also provides a Bellman equation that the value function 
must satisfy. In §2.3 we look at a martingale characterization of the optimal control 
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problem; thrifty and equalizing policies are defined in the context of our problem, and 
we establish necessary and sufficient conditions for optimality in terms of thrifty and 
equalizing policies in Theorem (2.17). We discuss related reward-per-stage functions 
and their relationships to our problem and treat several examples in §3. Proofs of the 
main results appear in §4. The article concludes in §5 with a discussion of future work. 

§2. Main Results 

Our main results are stated in this section after some preliminary definitions and 
conventions. 

§2.1. Preliminaries. We employ the following standard notations. Let N denote the 
natural numbers {1, 2, . . .} and N denote the nonnegative integers {0} U N. Let 1 A (-) 
be the usual indicator function of a set A, i.e., = 1 if £ e A and otherwise. For 
real numbers a and b let a A b := min{a, b}. A function / : X — > R restricted to AC. x 
is depicted as f\ A . 

Given a nonempty Borel set X (i.e., a Borel subset of a Polish space), its Borel 
a -algebra is denoted by 23 (X). By convention, when referring to sets or functions, 
"measurable" means "Borel-measurable." If X and Y are nonempty Borel spaces, a 
stochastic kernel on X given Y is a function Q(- 1-) such that Q(-|y) is a probability 
measure on X for each fixed y e Y, and Q(B |-) is a measurable function on Y for each 
fixed B e <8(Z). 

We briefly recall some standard definitions below, see, e.g., [Hernandez-Lerma and 
Lasserre, 1996] for further details. A Markov control model is a five-tuple 

(2.1) {X,A,{AW\x<=X},Q,r) 

consisting of a nonempty Borel space X called the state-space, a nonempty Borel space A 
called the control or action set, a family {A(x) \ x e X} of nonempty measurable subsets 
A(x) of A, where A(x) denotes the set of feasible controls or actions when the system 
is in state ieX and with the property that the set IK := {(x, a)|x e X, a e A(x)} of 
feasible state-action pairs is a measurable subset of X x A, a stochastic kernel Q on X 
given IK called the transition law, and a measurable function r : IK — > K called the 
reward-per-stage function. 

(2.2) . Assumption. The set IK of feasible state-action pairs contains the graph of a 
measurable function from X to A. 

Consider the Markov model (2.1), and for each i = 0, 1, . . . , define the space H i of 
admissible histories up to time i as H := X and H i := IK 1 x X = K x H i _ 1 ,i e N. A 
generic element h ( of H h which is called an admissible i-history or simply i-history, is 
a vector of the form h { = (x ,a , . . . ,x i _ 1 ,a i _ 1 ,x i ), with (x J5 a^) e IK for j = 0, . . . ,i — 1, 
and x ; e X. Hereafter we let the a -algebra generated by the history h t be denoted by 
$i, i e N„. 

Recall that apolicy is a sequence 7i = (7r,) ieNo of stochastic kernels n i on the control 
set A given H, satisfying the constraint ^(ACx,)^) =1 Vfi ; e e N . The set 
of all policies is denoted by n. Let be the measurable space consisting of the 

(canonical) sample space f2 := H m = {X x A) 00 and let # be the corresponding product 
cr-algebra. The elements of Q are sequences of the form co = (x Q ,a Q ,x 1 ,a 1 ,...) with 
i, eX and a ; e A for all i e N ; the projections x ; and a ( from to the sets X and A 
are called state and control (or action) variables, respectively. 
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Let 7i = (7Ti) ieNo be an arbitrary control policy, and let v be an arbitrary proba- 
bility measure on X, referred to as the initial distribution. By a theorem of Ionescu- 
Tulcea [Rao and Swift, 2006, Chapter 3, §4, Theorem 5], there exists a unique proba- 
bility measure on (f2,3Q supported on H°°, such that for all B e «B(X), C e 03(A), 
^ e H„ i e N , we have P*(x e B) = v(B), 

(2.3)a PJ(a l eC|h i )=^(c|?ii) 

(2.3) b P^^eS^a,) =Q(B|x I ,a 1 ). 

(2.4) . Definition. The stochastic process (fl,^, P™, Qc,)ieN ) is called a discrete-time 
Markov control process . 

We note that the Markov control process in Definition (2.4) is not necessarily Mar- 
kovian in the usual sense due to the dependence on the entire history ft, in (2.3)a; 
however, it is well-known [Hernandez-Lerma and Lasserre, 1996, Proposition 2.3.5] 
that if (7Ti) ieNo is restricted to a suitable subclass of policies, then (x ; ) ieNo is a Markov 
process. 

Let 4> denote the set of stochastic kernels tp on A given X such that t£>(A(x)|x) = 1 
for all x e X, and let F denote the set of all measurable functions / : X — > A satisfying 
/(x) e A(x) for all x eX. The functions in F are called measurable selectors of the 
set-valued mapping X 3 x ■ — > A(x) c A. Recall that a policy n = (7i 1 ) ieN(j e n is said 
to be randomized Markov if there exists a sequence (i/?;)ieN of stochastic kernels <^ e $ 
such that n^Olfi;) = c^OIx;) V/i ; e fJ i; i e N ; deterministic Markov if there exists a se- 
quence (/;); eNo of functions /; e F such that ^(-Ifi;) = 5^ x .^-); deterministic stationary 
if there exists a function/ e F such that ^(-Ifi,) = 5/(x,)(0- As usual let n, FL^, Yl DM , 
and Yl DS denote the set of all randomized history-dependent, randomized Markov, de- 
terministic Markov, and deterministic stationary policies, respectively. The transition 
kernel Q in (2.3)b under a policy n := (^) ieNo e n RM is given by (Q(-|-, </>i)) ierv 
which is defined as the transition kernel 25(X) x X 3 (B,x) ■ — > Q(B|x, i^,(x)) := 
Jjtf*) ( / ? i(da|x)Q(B|x,a). Occasionally we suppress the dependence of ^, on x and 
write Q(B|x,<£,) in place of Q(B|x, i^-(x)), and r(x j3 ipj) := J^^^dalx^r^x^a). 
We simply write f°° for a policy (/,/, . . .) e n DS . 

§2.2. Maximizing the Probability of Hitting a Target before a Cemetery Set. Let O 

and K be two nonempty measurable subsets of X with O ^K. Let 

(2.5) t := inf{t eN | x t e 0} and t' := inf{t e N | x t eX \ K} 

be the first hitting times of the above sets. 1 These random times are stopping times 
with respect to the filtration (^„) neNo . Suppose that the objective is to maximize the 
probability that the state hits the set O before exiting the set K; in symbols the objective 
is to attain 

(2.6) V*(x) := supV(?i,x) := supP^(r < t',t < oo), 

jren nen 

where the sup is taken over a class n of admissible policies. 

(2.7) . Admissible policies. It is clear at once that the class of admissible policies for 
the problem (2.6) is different from the classes considered in §2.1. Indeed, since the 
process is killed at the stopping time t A t', it follows that the class of admissible 
policies should also be truncated at the stage tAt'-1. For a given stage t e N we 
define the t-th policy element n t only on the set {t < t A t'}. Note that with this 
definition n t becomes a # tATAT /-measurable randomized control. It is also immediate 



^As usual we set the infimum over an empty set to be oo. 
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from the definitions of t and t' that if the initial condition x e O U (X \ then the 
set of admissible policies is empty in the sense that there is nothing to do by definition. 
Indeed, in this case tAt' = and no control is needed. We are thus interested only 
in x e K \ O, for otherwise the problem is trivial. In other words, the domain of n t 
is contained in the 'spatial' region {(x,a) e IK | x e K \ 0,a e A(x)}. Equivalently, in 
view of the definitions of the 'temporal' elements t and t', n t is well-defined on the 
set {t < i A t'}. We re-define IK := {(x,a) eIK|xeJC\0,a eA(x)}, and also let F to 
be the set of measurable selectors of the set- valued map K \ O 9 x < — > A(x) c A 



Throughout this subsection we shall denote by Tl M the class of Markov policies such 
that if (7T t ) teN e Tl M , then 7i t is defined on Kfor each t. 



(2.8) . Recall that a transition kernel Q on a measurable space X given another measur- 
able space Y is said to be strongly Feller if the mapping y • — > J g(x)Q(dx|y) is con- 
tinuous and bounded for every measurable and bounded function g : X — > R. A func- 
tion g : IK — > R is upper semicontinuous (u.s.c.) if for every sequence (x ; , cij)j eN i= IK 
converging to (x,a) e IK, we have limsup^^, g(x J; a,-) ^ g(x,a); or, equivalently, if 
for every r e R, the set {(x, a) e IK g(x,a) ^ r} is closed in IK. A set-valued map 

* : X — ►» Y" between topological spaces is upper hemicontinuous at a point x if for 
every neighborhood [7 of *(x) there exists a neighborhood V* of x such that z e V 
implies that *(z) c [/; * is upper hemicontinuous if it is upper hemicontinuous at 
every x in its domain. If X is equipped with a cr-algebra S, then the set-valued map 

* is called weakly measurable if *I^(G) e S for every open G £ y, where \lr is the 
lower inverse of *, defined by **(A) := {x e X | *(x) nA / 0}. See, e.g., [Aliprantis 
and Border, 2006, Chapters 17-18] for further details on set-valued maps. 2 Whenever 
B c X is a nonempty measurable set and we are concerned with any set-valued map 
B ^ x ■ — > A(x) c A, we let B be equipped with the trace of 95(X) on B. Let de- 
note the convex cone of nonnegative, bounded, and measurable real-valued functions 
onX, and we define B := {g e I^fZ) | g| ZxJf = 0, ||g|| LooQn ^ l}- 

(2.9) . Assumption. In addition to Assumption (2.2), we stipulate that 

(i) the set-valued map K \ O 3 x ■ — > A(x) c ^4 is compact-valued, upper hemicon- 
tinuous, and weakly measurable; 

(ii) the transition kernel Q on X given IK is strongly Feller, i.e., the mapping IK 3 
(x,a) i — > J x Q(dy|x,a)g(y) is continuous and bounded for all bounded and 
measurable functions g : X — > R. 

The following theorem gives basic existence results for the problem (2.6); a proof 
is presented in §4.1. 

(2.10) . Theorem. Suppose that Assumption (2.9) holds, and that tAt' is finite for 
every policy in Yl M . Then: 

(i) The value function V* is a pointwise bounded and measurable solution to the 
Bellman equation in ip : 



(2.11) ip(x) = l (x) + l^oCx) max 

aeAO) 

Moreover, V* is minimal in B n bQ5(X) + . 



r 

Q(dy|x,a)l K (y)iKy) VxeX. 

x 



2 What we call "set-valued maps" are "correspondences" in [Aliprantis and Border, 2006] . 
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(ii) There exists a measurable selector f+ e F such thatf^x) eA(i) attains the maxi- 
mum in (2.11) for each x e \0, which satishes 



(2.12) V*(x)-. 



1 if x e O, 
r 

Q(dy|x,/JV*Cy) ifx^K\0, 

K 

otherwise, 



where V* is as defined in (3.1). Moreover, the deterministic stationary policy/™ 
is optimal. Conversely, iff™ is optimal, then it satisfies (2.12). 

(2.13) . As a matter of notation we shall henceforth represent the functional equa- 
tion (2.12) with the less formal version: 

- 

(2.14) V 4 W=loW+WW Q(dy|x,/jV*(y) VxeX. 

JK 

Note that the measure Q(-|x,/J is not well-defined for x e O U {X \ K) for / e F in 
view of the definition in paragraph (2.7). As such the integral J K Q(dy|x,/ + ) V*(y) is 
undefined for x e O U (X \ JQ. However, to preserve the form of (2.11) and simplify 
notation, we shall stick to the representation (2.14) by defining any object that is 
written as an integral of a bounded measurable function with respect to the measure 
Q0\x,f) to be whenever x e O u {X \ K) and / e F. 

§ 2.3. A Martingale Characterization. We now return to the more general class of all 
possible policies (not just Markovian), denoted by IX 

Fix an initial state xeX and a policy jren. For each n e N we define the random 

variable W n (n,x) := ^"^^^ l (x t ). Let us consider the process (C n ) neN defined 
by 

2 15 (o-=V\x), 

1 K\0 (X 

We follow the basic framework of [Karatzas and Sudderth, 2009] . 

(2.16) . Definition. A policy n e n is called thrifty at x e X if V*(x) = A^Cx), and 
equalizing at x eJf if A^x) = V(n,x). The action a n , defined on {tAt'> n}, is said 
to conserve V* at x n if l (x„) + l iCx0 (x„) J K Q(dy\x n ,a n )V*iy) = V\x n ). 

Connections between equalizing, thrifty, and optimal policies for our problem (2.6) 
are established by the following 

(2.17) . Theorem. A policy 7i e n is 
o equalizing at x eX if and only if 

lim EJ[l XxO (x (n _ 1)ATAT 0(lK:V*)(^nATAT')] =0; 

n— >oo * u 1 

o optimal atx&X if and only if n is both thrifty and equalizing. 



A connection between thrifty policies, the process (C n )neN defined in (2.15), and 
actions conserving the optimal value function V* is established by the following 

(2.18). Theorem. For a given policy re e n and an initial state x eX the following are 
equivalent: 

(i) n is trifty at x ; 
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(ii) (QneNo is a (SJneNo -martingale under P n x ; 
(hi) P x -almost everywhere on {tAt'> n} the action a n conserves V*. 

It is possible to make a connection, relying purely on martingale-theoretic argu- 
ments, between the process (C n )„ eNo and the value function corresponding to an op- 
timal policy. This is the content of the following theorem, which may be viewed as a 
partial converse to Theorem (2.18). 

(2.19). Theorem. Suppose that either one of the stopping times t and %' defined 
in (2.5) is finite for every policy in U. Let V' be a nonnegative measurable function 
such thatV'\ = 1, V'\ X ^ K = 0, and bounded above fay 1 elsewhere. For a policy n e n 
define the process (£' ' )„ 

Co ■= V'ix), 

(2 20) 

C : = W n (Tl,x) + lj<xo(*Cn-l)ATAT'XljC • VOUnATAT'), " ^ N, 

where W n (n,x) is as in (2.15). If for some policy n* e n the process (C^) n<SN JS a 
(3 r n)neNo -martingale under Pf, then V"(x) = Pf (t < t', t < oo) . 

Proofs of the above results are presented in §4.2. 



§3. Discussion and Examples 



Let us look at the stopped process (x tA (n-i)ATAT')teN - It is clear that in this case 
V n (n, x) = 1 whenever x e O and V n (7i, x) = whenever x e X \ K for all policies in 
n M ; otherwise for x e if \ O we have 



V„(ti,x) := P^(T<T / ,T<n) 

= P" (^IAtAt' e O) + P" ( X 1AtAt 
+ • • « + Pjc (^IAtAt'j • • • > *(n-2)ATAT' 



K ~\0, X 2AtAt ' 



o) 



:JC\0,x (n _ 1)ATAT / eO). 



Since the fc-th term on the right-hand side is [Y\ t=1 IjcxoUtATAT'HoUfcATAT')], [t 
follows that 



V n (7T,x)=E* 



'(n-l)AxAT' ( t— 1 



t=i 

■(n-l)AT 



i=0 



lo( X tAT') 



E" 



'(n— 1)AtAt' 

2 i Ut) 



We note that V n {n, x) = whenever x e O U (X \ if). A policy that maximizes V n (7i, x) 
is defined only on the set K \0, and it is left undefined elsewhere. Once the process 
exits K \ O or the stage reaches n — 1, the task of our control policy is over. Such a de- 
terministic stationary policy (which exists, as demonstrated below) with a measurable 
selector / e F should be represented as / TAT '■= (J, f ,-..,/) since it is applied only for 



tAt times 



the first tAt' stages; however, for notational brevity we simply write /°° hereafter. 
Quite clearly, letting n — » oo, the monotone convergence theorem gives 
V{n, x) = lim V n {n, x) = P K x (t < %' , % < oo) 



= E! 



"tAt 




X! l oC* t ) 
. t=i 





tArAr' 



£=1 
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We note that by definition, the random sum inside the expectation on the right-hand 
side of the last equality above is the limit of partial (finite) sums, and this ensures that 
the term inside the expectation is defined on the event (tAt' < oo}. By definition 
note that 



(3.1) 



V*(x) = sup V{n,x)-- 

7ten M 



sup E n x 

7ien M 



TAT 

s 

t=l 



Consider again the value-iteration functions defined by 



(3.2) 



v (x) := 1 W 

v„(x) := l O) + IjcxoU) max 

aeA(x) 



QCdylx.aJljfCyK-iCy) 



for x e X and n e N. The function v n is clearly identifiable with the optimal value 
function for the problem of maximizing P^(t < t',t < n) of the process stopped 
at the (n — l)-th stage, n e N. To get an intuitive idea, fix a deterministic Markov 
policy n' = (J t )t&$ an d take the first iterate v . (Of course the assumption under- 
lying the notation (/ t ) teNo is that f t is defined on {t < tA t'}.) It is immediately 
clear that the reward at the first step is 1 if and only if x e K and otherwise, and 
that is precisely v irrespective of the policy. For the second iterate note the reward 
under the policy n' is l (x) + l JCxO (x)Q(O|x,/ (x)). This is because the reward is 
1 if x e O and the process terminates at the first stage, or x elf\0 and the re- 
ward at the second stage is the probability of hitting O at the second stage. Of 
course there is no reward if x e X \ K. Similarly, for the third iterate the reward 
is l (x)+ lxso(x) J K Q(d? 1 k,/ W)(l o (? 1 )+ 1 Kn0 (£i)Q(0|?i>A(£i)))- Note that 
only those sample paths that stay in K \ O at the first step contribute to the reward 
at the second stage, only those sample paths that stay in K \ O for the first and the 
second stages contribute to the reward at the third stage, and so on. 



§3.1. A general setting and various special cases. Our problem (2.6) can be viewed 
as a special case of a more general setting. To wit, consider a nonnegative upper 
semicontinuous reward-per-stage function r : IK — > K ?0 and the problem of maxi- 
mizing the total reward up to (and including) the hitting time t A t', i.e., maximize 

E" [2^=0 r ( x t> a t)] over a c l ass °f policies. This corresponds to maximization of the 
reward until exit from the set K\0. The value-iteration functions (v^) ngN correspond- 
ing to this problem can be written down readily: for x e X and n e N let 



Vq(x):= sup r(x,a)l 0u(XxK) (x), 

aeA(x) 



v' n {x):= sup 

aeA(x) 



r(x,a)l outfN j (x) + Ijc^oOO 



QCdylx.eOv^Cy) 



Our problem (2.6) corresponds to the case of r(x,a) = 1 M- Modulo the additional 
technical complications involving integrability of the value-iteration functions at each 
stage and the total reward corresponding to initial conditions being well-defined real 
numbers, the analysis of this more general problem can be carried out in exactly the 
same way as we do below for the problem (2.6). While the above more general prob- 
lem treats both the target set O and the cemetery state X \ K equally, the bias towards 
the target set O is provided in our problem (2.6) by the special structure of the reward 
r{x,a) = l (x). 

From the general framework it is not difficult to arrive at reward-per-stage functions 
that are meaningful in the context of reachability, avoidance, and safety. For the sake 
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of simplicity, till the end of this subsubsection we suppose that for all initial conditions 
and admissible policies 7ien the stopping times z and z' are finite -almost surely. 
With this assumption in place, let us look at some examples: 

o Consider a discounted version of our problem (2.6), namely, let 



:= E* 



where a e ] 0, 1 [ is a constant discount factor. From the definitions of z and z' it 
follows that Xit=o at lo( x t) = aT l{x<T'}> an d in view of the range of a it follows 
that maximization of V (1) over admissible policies leads to small values of z on the 
set {t < z'} on an average, but it is silent about the values of z on {t > z'}. 

To get a more quantitative idea of the role that the discount factor a plays, let 
f be a random variable independent of the Markov control process defined in Def- 
inition (2. 4), 3 with distribution function P(f = n) = (1 — a)a" for all n e N . In a 
standard way we construct the product probability measure P n <g> P and denote the 
expectation with respect to this measure as EJ ,T [-]. We can write 



V (1 W) = E" 



t=o 



(l-ar^tloCx*)!^^]. 



In view of the definitions of z and z' we get V^(ti,x) = (1 — a) 1 E"' T [lj f=T;T<T / } ] . 
This alternative characterization shows that maximization of over admissible 
policies leads to smaller values of z compared to z'; moreover, the random variable 
f gives a quantitative idea of how the choice of a determines the outcome since 
f is a geometric random variable with parameter (1 — a). Choosing a small a im- 
plies smaller f with higher probability and may appear to be profitable; however, in 
certain problems it is possible that the set O may be reachable at f with small prob- 
ability and the corresponding event of interest {f = z,z < z'} may be relatively 
small for a given initial condition x. Moreover, the factor (1 — a) -1 is small for small 
values of a, and contributes to this phenomenon. 

A second quantitative view of the role of a is offered by the fact that V^\tz, x) = 

E x' T Et=o AT l o( x t)] • Indeed, we have 



2 loOt) 



L t=0 

E? 



^ lo( X t)l{t!=TAT'} 

t=0 

a"(l-a)^]l (x t )l {t <; TAT , 

t=o 

co n oo 



s 

,n=0 



oo n 
\ ' \ ' „n+l- 



00 00 



7 , / i a " 1 o( X t)ljtgTAT'} 

t=0 n=t 



00 00 



E 71 



00 a c 
„ 1 — a 1 



t=0 n=t 

t+1 



tAt'} 



. t=0 

' oo 

s 

. t=0 



t=0 



at lo( X t)l{t<TA 



E! 1 



^-i a 

V (1) (7T,X). 



'The random variable f can be denned in a standard way by enlarging the probability space. 
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In this setting we do not have the (1 — a) -1 factor outside the expectation as in the 
second version of above, and it demonstrates that maximizing V^C^x) over 
admissible policies leads to maximizing the probability of the event (t < tA z'}, 
where a controls the values of f as before, 
o Consider the reward-per-stage function r(x,a) = 1 M — ljfxoM- Under integra- 
bility assumption on z A z' under all admissible policies, we have 



V {2 \n,x) :=E 



J(l (x t )-Wx t )) 

- t=0 

tAt' 

t=0 

P"(t < t') 



L X^K 



P*(t' <t)-E*[tAt']. 

Clearly, maximization of over admissible policies leads to both the maximal 
enlargement of the set {t < z'} and minimization of the hitting time z on this set. 
o Consider r(x,a) = l (x) — l x ^ K (x). This leads to the expected total reward until 
escape from K \ O as 



V (3) (ti,x) := E" 



t=0 



P n Jz<z')-P n (z' <z). 



Since P"(t < z') + P"(t' < t) = 1, maximization of over admissible policies 
maximizes the probability of the event {t < z'}. Thus, maximizing V* (3) (ti,x) over 
7i e n is a different formulation of the objective of our problem (2.6). The above 
analysis also shows that the same objective results if we take the reward-per-stage 
function to be l (x) — yl x ^ K [x) for any y ^ 0. 
o Suppose that z A z' is integrable for all admissible policies and consider the reward- 
per-stage r(x, a) = 1 6 



V (4} (7i,x) := 



Maximization of V (4) over admissible policies leads to large values of z A z' on an 
average. This is a form of safety problem, the state stays inside K \ O for as long as 
possible on an average, 
o Suppose that z A z' is integrable for all admissible policies and consider r(x, a) = 
yl (x) — l K ^ (x) for y ^ 1. Consider 



V (5 W) := E" 



(riofe) - IxnoC^)) 



t=0 



we see that V* (s:i (7r, x) = tP"(t <t')-E"[tA t'] . We see that maximization of 
over admissible policies leads to a balance between maximizing the probability that 
the state hits the set O before getting out of K and exiting K quickly. This is because 
it is more profitable to exit from K and get a zero reward than incur negative reward 
by prolonging the duration of stay in K \ O. The factor y decides the priorities of 
the two alternatives. It is trivially clear that y = 1 leads to rapid exit from K if the 
initial condition is in K \ O. 

Not all the reward-per-stage functions mentioned above can be handled in our 
present framework. In particular, we make the crucial assumption that the reward- 
per-stage function is nonnegative, which does not hold in some of the cases above. 
However, under appropriate growth-rate conditions on the reward-per-stage function, 
the nonnegativity assumption can be dispensed with. 
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In classical finite or infinite-horizon optimal control problems a translation of the 
(fixed) reward-per-stage function would not change the solution to the problem. How- 
ever, translations of the reward-per-stage function in random-horizon problems may 
lead to drastically different policies. We give two examples: 

o Consider the reward-per-stage functions r'(x,a) = l (x) — 1 Xx jc( x ) an d r"(x,a) = 
2 • l (x) + lj^oCx); in this case we translate r' on X by 1, i.e., r" = r' + 1. 
On the one hand, maximizing E"[2J=^ r '(*t> a t)] yields a policy that P"(t < 
t') as we have seen before (this is V (3) above). On the other hand, maximizing 

E" Et=o r "( x t> a t)] yields a policy that tries to keep the state in K \ O for as long 
as possible, and at each stage accrue a reward of 1, which is certainly better than 
jumping to O and accruing a reward of 2 at most, 
o Consider r'(x, a) = l (x) — l XxJf (x) and r"(x, a) = — l (x) — 3 • l XxJC (x); in this 
case we translate r' by —2 only on its support O U (X \ K). We have noted above 
that maximizing E"[^t^o r '( x t> a t)] yields a policy that maximizes P"(t < t'). 
However, maximizing E"[^t=o r "( x t> a t)] yields a policy that tries to keep the 
state in K \ O for the longest possible duration to avoid incurring negative reward. 

§ 3.2. Further examples. For one-dimensional stochastic processes initialized some- 
where between two different levels a and b, problems such as calculating the proba- 
bility of hitting the level a before the level b are fairly common, e.g., in random walks, 
Brownian motion, and diffusions, see, e.g., [Levin et al., 2009, Chapters 2-3], [Revuz 
and Yor, 1999]. It is possible to obtain explicit expressions of these probabilities in a 
handful of cases. 

Let us consider a controlled Markov chain (x t ) teN with a finite state-space X = 
{1,2,..., m} and a transition probability matrix Q = [q;j(a)] mxm , where a is the action 
or control variable. Let O ^ X, K ^ X be subsets of X with O ^ K. Since X is finite, 
Assumption (2.9) is satisfied. Consider the problem (2.6) in the context of this Markov 
chain (x t ) £€N initialized at some i Q e K \ O. By Theorem (2.10) the optimal value 
function V* must satisfy the equation 



for all i el. If the control actions are finite in number, searching for a maximizer over 
an enumerated list all control actions corresponding to each of the states may be possi- 
ble if the state and action spaces are not too large. However, the memory requirement 
for storing such enumerated lists clearly increases exponentially with the dimension of 
the state and action spaces if the Markov chain is extracted by a discretization proce- 
dure based on a grid on the state-space of a discrete-time Markov process evolving, for 
example, on a subset of Euclidean space. As an alternative, it is possible to search for a 
maximizer from a parametrized family of functions (vectors) by applying well-known 
suboptimal control strategies [Bertsekas, 2007, Chapter 6], [Bertsekas and Tsitsiklis, 
1996; Powell, 2007]. Note that in the case of an uncontrolled Markov chain the equa- 
tion above reduces to V*(i) = l (i)+ l KNO (i)(I] je o &j +£;<=jtxo % V *U)), and can be 
solved as a linear equation on K \ O for the vector V* | Jfx0 . Thus, solving for V* yields a 
method of calculating the probability of hitting O before hitting X \ K in uncontrolled 
Markov chains, and can serve as a verification tool [Kwiatkowska et al., 2007]. 
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In certain cases of uncountable state-space Markov chains the policies and value 
functions corresponding to maximization of P^(t < t', t < n) can be explicitly cal- 
culated for small values of n. As an illustration, consider a scalar linear controlled 
system 



(3.3) 



x t+1 =x t + a t + w t , x = x, teNj. 



Here x t e K is the state of the system at time t, a t is the action or control at time 
t taking values in [—1,1], and (w £ ) teN is a sequence of independent and identically 
distributed (i.i.d) standard normal random variables treated as noise inputs to the 
system. Let us suppose that our target set is O = ] — 1, 1[, safe set isK = [—3,3], and let 
us find a greedy policy for our problem, i.e., a policy that maximizes (t <t',t<2). 

The greedy policy tries to maximize 
P x ( Xl e ] - 1,1[) = P x (x + a + 
w e ] - 1,1[) = m{l - x - a) - 
?Tt(-l - x - a) =: G(x,a), where 91 
is the cumulative distribution function 
of the standard normal random vari- 
able. The function G can be expressed 
in terms of the complementary error 

function 4 as G{x,a) = |^erfc(— J=(l — 

x - a)) - erfc(-J=(-l - x - a))), 
and argmax aS[ _ 1 G(x, a) can be solved 
in closed form. Indeed, |^(x,a) = 




-^==(e 2 (x+a+1) - e 2 (x+a V ) = o gives a* = /Jx) = -x as the unconstrained op- 
timizer. Since a e [—1,1], we have the constrained maximizer as fX x ) = — sat(x), 
where sat(-) is the standard saturation function. 5 In other words, we get a bang-bang 
controller since x — sat(x) ^Oon the interior of K \ O. It is easy to discern the maxi- 
mizer from the accompanying figure. The corresponding maximal probability is found 
by substituting the above optimizer back into the dynamic programming equation, and 

this yields V*(x) = l (x) + il Kx0 (x)(erf( j= (x - sat(x) + 1)) - erf(^(x - sat(x) - 

1)) J . For n = 3 it turns out that we can no longer compute the optimizer correspond- 
ing to the first stage in closed form; the optimizer for the second stage is, of course, 
fX x ) = — sat(x) calculated above. It is also evident from the accompanying figure 
that even in this simple example there will arise nontrivial issues with nonconvexity 
for n ^ 3. 



§3.3. Uniqueness of optimal policies. So far in our discussion we have not addressed 
the issue of uniqueness of the optimal policy in our problem (2.6). (Theorem (2.10) 
shows that an optimal policy exists, so the uniqueness question is meaningful.) It 
becomes clear from considerations of the geometry of the sets O and K in simple 
examples that the optimal controller f+ in Theorem (2.10)(ii) is nonunique in general. 
For instance, consider the linear system considered in (3.3) above with initial condition 
x = 0, and let O = ] — 2, — 1[ U ]1,2[ and K = [—3,3]. Since the noise is symmetric 
about the origin, from symmetry considerations it immediately follows that the optimal 
controller f+ is nonunique at the origin. Note that f+ is, of course, defined on K \ O. 



4 Recall that the complementary error function is defined as erfc(r) := -y= J r e f dt = 1 - erf(r), where 
erf(-) is the standard error function. 

5 Recall that the standard saturation function is defined as sat(r) equals r if |r| < 1, 1 if r > 1 and -1 
otherwise. 
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§ 3.4. Relation to a probabilistic safety problem. Let us digress a little and con- 
sider the following probabilistic safety problem: maximize the probability that the 
state remains inside a safe set C CI for n stages, beginning from an initial condi- 
tion x e C. This, as mentioned earlier, is the probabilistic safety problem addressed 
in [Abate et al., 2008] . Of course the probability of staying inside C for the first n 
stages is given by (fT=o{ x t e C}) = E" [O"=o 1{* ec}] ■ ^ CT i s tne first exit time 
from C, then P^{fXZli x t e C}) = £ K X \Xt?Jo^ {n ~ 1] Mx,ec}]- Therefore, in this partic- 
ular problem there is no difference between the maximal values of E" [Ot=o ec}] 

or Ey [I~It=o 1)A( " ^ ec}] ■ However, the policies arising from the two different max- 
imizations are quite unlike each other. Indeed, whereas the former yields a determin- 
istic Markov policy [Abate et al., 2008] whose every element is defined on all of X, 
the stopping time version yields a deterministic Markov policy whose t-th element n t 
is defined on the set {t < a A n}, just as discussed in paragraph (2.7). On the one 
hand note that the reward in the former case is not affected by further application of 
the control actions once the state has exited the safe set C; the policy resulting from 
this formulation, however, dictates that the control actions are carried out until (and 
including) the {n — 2)-th stage nonetheless. On the other hand, the reward in the lat- 
ter stopping time version saturates at the stage the state leaves C and future control 
actions are not defined. 

It is interesting to note that the Bellman equation developed for probabilistic safety 
and reachability in [Abate et al., 2008] may be obtained as a special case of (2.11) 
in Theorem (2.10) above. This comes as no surprise. The problem of maximizing the 
probability of staying inside a (measurable) safe set C c x for N steps is given by the 

maximization of E" [Ot^o^ 1) lcC x t)]> where a is the first time to exit C and this 
clearly translates to minimizing P"(t < JV). In our setting, if we let K be the entire 
state-space X, C =X \ 0, and z the first time to hit the set O, then our problem (2.6) 
is precisely that in [Abate et al., 2008] with the exception of maximization in place of 
minimization. It must be mentioned however, that the analysis carried out in [Abate 
et al., 2008] relies on the approach in [Bertsekas and Shreve, 1978] and is purely 
analytical; the strong Feller assumption on the transition kernel in our formulation 
plays no role there. 

§4. Proofs 

This section collects the proofs of the various results in §2. 

§ 4.1. Proof of Theorem (2.10). We recall a few standard results about set-valued 
maps first, followed by sequence of lemmas before getting to the proof of Theo- 
rem (2.10). The various definitions in paragraphs (2.7), (2.8), and (2.13) will be 
employed without further reference. Just as in §2.2, for the purposes of this subsec- 
tion, we let n M denote the set of admissible Markov policies such that n t is defined on 
IK whenever (7i t ) teN e n M . 

(4.1). Proposition ([Aliprantis and Border, 2006, Lemma 17.30]). Let* :X — >->Y be 
an upper hemicontinuous set-valued map between topological spaces with nonempty 
compact values, and let f : Graph^) — > R be upper semicontinuous. 6 Define the 
function m : X — > R by m(x) := max yeiI , (jc )/(x,y). Then the function m is upper 
semicontinuous. 



'Recall that Graph(*) is the set {(x, | x eX} c x x Y, the graph of the set-valued map *. 
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(4.2) . Proposition ([Aliprantis and Border, 2006, Theorem 18.19]). LetX be a sep- 
arable metrizable space and (S,S) a measurable space. Let* : S — **X be a weakly 
measurable correspondence with nonempty compact values, and suppose/ : SxX — > 
K is a Caratheodory function. 7 Let us also define the function m : S — > R by 
m(s) := max x€ ^/(s,x), and the correspondence p, : S — ** X of maximizers by 

:= { x e *( s )|/( s >x) = m ( s )}- Then the argmax correspondence (i is measur- 
able and admits a measurable selector. 

(4.3) . Definition. For iz e b<B(X) + nB we define the mapping Tu 

(4.4) X 3 x — » TuQc) := l (x) + l Is0 (x) sup Q(dy|x,a)u(y) el^,. 

aeA(x) J K 

The operator T is called the dynamic programming operator corresponding to the prob- 
lem (2.6). 

(4.5) . Lemma. Suppose that Assumption (2.9) holds. Then the dynamic programming 
operator T defined in (4.4) takes b<B(X) + n B into itself. Moreover, there exists a 
measurable selector/ e F such that 



(4.6) ru(x) = l (x) + l Xx0 (x) 



Q(dy|x,/My) VieX. 
Jk 



Proo/. Fix u e b<B(X) + nB. Since the transition kernel Q is strongly Feller on IK, the 
mapping 



r 



<{x,d)< — >S{x,o) := 



is continuous on IK. Also, S(x, a) is bounded whenever u is, a bound of S being the 
essential supremum norm of u. Therefore, since A(x) is compact for each x e X, the 
function S*(x) := sup aeAW S(x, a) is well-defined on \ O, i.e., the sup is attained on 
A(x) for x & K \ 0. We also note that since K \ O is a measurable set, by Assump- 
tion (2.9) 

o the correspondence K \ O 3 x ■ — > A(x) c A is upper hemicontinuous, and since S 
is continuous on IK, the map K \ O 3 x ■ — > S*(x) := max aeAW S(x, a) e R ?0 is an 
u.s.c. function by Proposition (4.1); 

o the correspondence K \0 3 x ■ — > A[x) c A is weakly measurable, and since S is con- 
tinuous on IK (and therefore is a Caratheodory function), there exists a measurable 
selector / e F such that S*(x) = S(x,/(x)) for all x e K x O by Proposition (4.2). 

It follows at once thatX 3 x ■ — > 7u(x) = l (x)+ l K . (x) LQ(dy|x,/(x))u(y) e 
R >0 is a member of the set b*8(X) + , and the assertion follows. □ 

(4.7). Lemma. Suppose that hypotheses of Theorem (2.10) hold. Ifu e b<B(X) + nB 
satisfies the inequalityu ^ Tu pointwise onX, then also u ^ V* pointwise onX, where 
T is the dynamic programming operator in (4.4). 



Proo/ By definition of T it is clear that we only need to examine the validity of the 
assertion on K \ O. Suppose that u e b23(X) + n B satisfies the inequality u ^ Tu 



Recall that a Caratheodory function f : S x X — > Y is a mapping that is measurable in the first 
variable and continuous in the second, where (S, S) is a measurable space and X, Y are topological spaces. 
In particular, if X is a separable and metrizable space, and Y is a metrizable space, every Caratheodory 
function f : S x X — > Y is jointly measurable [Aliprantis and Border, 2006, Lemma 4.51]; this is clearly 
true in the Caratheodory functions we consider. 
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pointwise on X. By Lemma (4.5) we know that there exists / el satisfying 



Tiz(x) = l (x) + ljtsxiO) 



Q(dy|x,/My) VxeKxO. 



A straightforward calculation shows that if u ^ Tu then Tu ^ T o Tu on K \ O. Fix 
x e \ O. Applying the inequality u^Tu repeatedly we have 



U(X) S= l (x)+ ljfxoU) 

sS l (x)+l JCx0 (x) 



QCd^lxJMCi) 



Q(d? x |x,/) 



and after n steps 

"to ^ l (x) + ljfs. (x) 



••+ W?n- 2 ) Q(d? n _il? n _ 2 ,/) 



lo(? n -i) 



= 1 W + IjcnoM 



QCd^lx,/) 



Q(d(„i?„- 11 /)ii(y 



• • + l{Cxo(?n-2) 



Q(d? n _il? n -2,/) 



Q(d? 2 |?i,/)- QCdCJ^JMU )• 

JCxO Jjf 



+ | Wx) Q(d^|x,/) 
We claim that the right-hand side of the last equality above is 

~(n— 1)AtAt' -| 

lo( x t) + E{ ^ljf\o( x (n-l)ATAT')(lj<: ' 1, )( x iiAtAt')1{tAt'<oo}] j 
t=0 J 

where 1 K • u(^) := l Jf (^)u(^) for E, e X. To see this note that the first term is clear 
by definition. The second term above is due to the fact that only those trajectories 
that stay in K \ O for n steps (i.e., from stage through stage n — 1) contribute to 
the integrand that features u, and this accounts for the factor 1k\o( x („-i)atat')- Since 
{tAt' < oo} is a full measure set, the factor lj TAT / <oo} does not change the value of the 
integral. Taking the limit of the first term above as n — » oo, the monotone convergence 
theorem gives 



lim Ef 



(n— 1)AtAt' 



EC 



'tAt' 



2 ^OO 



where the last inequality follows from the definition of V*. Since u is bounded and 
nonnegative, taking the limit of the second term above as n —> oo, the dominated 
convergence theorem gives 

I™ E^ [l_K\o( X (n-l)ATAT'Xlit ' U X X nATAT')l{TAT'«x>}] 

= E{ |^1 JCx0 (x tAt /)(1 k • u)(x tAt ')1j tAt / <00 } J 
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since 1j<xo( x tat') = on the set {tAt'< 00} by definition of the stopping times t 
and t'. Substituting back we see that u(x) ^ V*(x), and the assertion follows since 
x e K \ O is arbitrary. □ 

(4.8). Lemma. Suppose that Assumption (2.9) holds. Then the value iteration func- 
tions (v„) neNo defined in (3.2) satisfy v n f V*, and tne function V* satisfies the Bellman 
equation (2.11). 



Proof. From the definition of the value-iteration functions (v„) neNo in (3.2) we see that 
( v n)neN * s a monotone increasing sequence bounded above by l x . Therefore there 
exists a measurable function v* : X — > [0, 1] such that v n f v* pointwise on X. By 
definition of v„ we have 



'(n— 1)AtAt' 
t=0 



s= sup 

7i€n M 



'(/I— 1)ATAt' 



t=0 



v„(x), 



and the monotone convergence theorem shows that 

~(n— 1)AtAt' 



v*(x) = lim v„(x) ^ lim E n 

n— »oo n— »co x 



t=0 



^ l (x t ) = E* £l (x t ) 



t=o 



Taking the supremum over tt e n M on the right-hand side shows that v* ^ V* point- 
wise on X. Note that v n \ = 1 and v„| XnK = for all n; therefore v*| = 1 and 



Let us define the maps 



'{x,d)< — > T'v n (x,a) := 



K 3 (x, a) ■ — » T V(x, a) := 



Q(dy|x, a)v„(y)e [0,1], 



Q(dy|x,a)v*(y)e[0,l]. 



We note that the transition kernel Q is strongly Feller by Assumption (2.9), and there- 
fore T'v n , neN and T'v* are continuous functions on IK. Moreover, for all n e N we 
define 

r'v n (x, a) = T V(x, a) = 1 for x e O and a e A(x), 

(4.9) 

T'v n (x, a) = I v*(x, a) = for x e X \ K and a e A(x), 

Since v„ | v* pointwise on X, it follows from the definitions above and the monotone 
convergence theorem that for all x e X and a e A(x) 

(4.10) T'v n (x, o)l Jts0 (x) T T V(x, a)l Jfx0 (x). 

Fix x e K \ O. Since T'v n and T'v* are continuous functions on K, for each n e N 
both sup aeA(x:) T'v„(x,a) and sup aeA ^ T'v*[x,a) are attained onA(i). From the def- 
inition of (v n )„ eN in (3.2) we have max aeA ^ x - j T'v n [x,a) ^ max aeA ( x -| T'v*(x, a) for 
all n e N . Also, (max aeA( - x -) T / v n (x,a)) neNo is a nondecreasing sequence of num- 
bers bounded above by 1, and therefore it attains a limit. If this limit is strictly 
less than max a&A ^ T'v*(x,a), standard easy arguments may be invoked to show that 
the sequence of continuous functions (T'v n (x, 0) nsNo cannot converge pointwise to 
T'v*(x, •) on A{x), which contradicts (4.10). It follows that whenever x e K \ O, 

v*(x) = lim v„(x) = lim Tv n _ 1 (x) 

n— »oo n— *oo 

= lim max T'v^^x, a) = max T'v*(x, a) 
= 7V(x). 
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Together with (4.9) this shows that v* satisfies the Bellman equation (2.11) pointwise 
onX, i.e., v* = Tv*. We have already seen above that v* ^ V* pointwise onl. Since 
v* = Tv*, the reverse inequality follows from Lemma (4.7). Therefore, we conclude 
that v* = V* identically on X. □ 

(4.11). Lemma. Let/ 00 be a deterministic stationary policy. Then we have 

ifx e O, 



(4.12) 



V(J°°,x)={ 







Q(dy|x,/)V(/°°,y) ifxeK^O, 



otherwise. 



Proof. For x e O u {X \ K) the assertions are trivial. Fix xeKxO. From the definition 

of V*(/°°, x) we have 



V(f°°,x) = Ef° 



'tAt' 



1] i C* t ) 

t=0 

1o( x o)1{tat'=o} + 1{tAt'>0} lo( x t) 



f=l 



X n = X 



l (x)+E^ 



1{tAt'>0} lo( x t) 



t=l 



Xn — X 



Since {t A t' > 0} = {x e K \ 0} and this event is ^-measurable, 





tAt' 






"tAt' 






1{tAt'>0} lo( x t) 


x = x 


= Wx)E^ 


1] U x t ) 


x = x 




t=l 






- t=l 





Therefore, 



V(/°°,x)=l (x) + W x )E r 
= l (x)+l JCx0 (x)E^ 



2 1 oC*t) 

t=i 

T 

1oC x catat') 



Xa — X 



Considering the fact that V(J 00 , x) = for x e X \K by definition, the Markov property 
shows that the second term on the right-hand side above equals 



t=i 



— ljf\o( x ) 



^! lo( X tATAx') 

Q(dy\x,f)& 



1 1AtAt' 



lo( X tATAx') 



t=l 



x iatat' ~~ y 



Q(dy|x,/)V(/°°,y). 



Collecting the above equations we obtain (4.12), and this completes the proof. □ 

We are now ready for the proof of the first main result. 

Proof of Theorem (2.10). (i) Note that by definition V* is nonnegative. The fact that V* 
satisfies the Bellman equation follows from Lemma (4.8). In view of the definition of 
B in Theorem (2.10) and Lemma (4.8) we conclude that V* is minimal in b25(X) + <1B 
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because u = Tu pointwise on K \ O implies that u ^ V* pointwise on K \ O for any 
ueb<8(X) + nB. 

(ii) Lemma (4.5) guarantees the existence of a selector f+ e F such that (2.12) holds. 
Iterating the equality (2.12) (or (2.14)) it follows as in the proof of Lemma (4.7) that 
for x el, 



roc 



(n— 1)AtAt' 
t=0 



+ [ljC\o( x (n 



(n— 1)atAt' 



)(l K V*)(x nATAT 0]. 



Taking limits as n — » oo on the right, the monotone and dominated convergence the- 
orems give V*(x) = V(f™,x). Since x is arbitrary V*(0 = V(/~,0 on K \ O and 
that /j 50 is an optimal policy. Conversely, by Lemma (4.11) it follows that under the 
stationary deterministic strategy f™ we have (4.12) with in place of /, which is 
identical to (2.12). □ 



§ 4.2. Proofs of the results in §2.3. For the purposes of this subsection we let n 
denote the set of admissible policies such that n t is defined on IK whenever (7Z t ) tm e 

n. 

(4.13). Lemma. For every policy nen and initial state ieX the processes (£n) neN 

and (ljcxo(*(n-i)ATA T 'XlK • v *X*nATAT')) neNo are both nonnegative (# n ) ngNo - super- 
martingales under P\ 



Proof. It is clear that both processes are nonnegative and (S' n ) neNo -adapted. Fix nsN, 
an initial state x e X, a policy n e n, and on the event {t A i' > n} fix a history 
K = (x,a ,x 1 ,a 1 ,...,x n _ 1 ,a n _ 1 ,x n ). Let a n := n n {h n ) on (tAt'> n}. Then 

Cn+l = W n+1 (7I,x) + ljC\o( x nATAT')(lK V *)( x (n+l)ATAT') 

= W n (7I,x)-|- l (x nATAT /)lj TAT ' =n j + l Jfx0 (x nATAT /)(l Jf V X x (n+l)ATAx') 

= W n (n,x) + l fTAT / =n} l (x nATAT /) + l{ TAT '> n }(lj f V*Xx( n+i:)ATAT /). 
Since {x„ AtAt / £0}c{tAt' = n}, we have 

l{xAx'=n}lo( x nAxAx') + l{TAx'>n}(ljf^ X x (n+l)AxAx') 

= IfTAT'^n} (lo( x nATAT') + ljCxo( x nATAT')(lj<:^' X x (n+l)AxAx')) • 

Since {t A z' ^ n} = {t A t' > n — 1} = {x( n _ 1)ATAT ' e K \ O}, it follows that 

Cn+l = VK n (7T,x) + l K:s0 (x (n _ 1)ATAT /)- 

(lo( x nAxAx') + ljfxo( x nAxAx')(ljf^' X x (n+l)AxAx')) • 

Therefore, keeping in mind the definition of a n above, 

[Cn+l | T?nAxAx'] = ^,,(71, x) + l JCx0 (x ( - r[ _ 1 -| ATAT /)r V (x nAxAT /, a„) 

(4.14) ^W n ( 7 r,x) + Wx 

(n-l)AxAx') ^ ( x nAxAx') 

= c. 

where the inequality holds P"-almost surely. Therefore, the process (C n ) neN is a non_ 
negative (3n A xAx')neN ~ supermartingale, and hence also a (JS n ) n ^ - supermartingale. 
Considering that the sequence (W n (7i,x)) neNo is nondecreasing, from the definitions 
in (2.15) and the fact that the process (£„) neNo is a G? n ) n eN ~ supermartingale we see 
that the process (l Kx o( x (n-i)AxAx')(lj f V'*)( x n AxAx')) neNo is also a ($ n ) nmo - supermar- 
tingale under P*. □ 
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Proof of Theorem (2.17). Lemma (4.13) confirms that both of the two adapted pro- 
cesses (£„ ) neNo and ( l Kx0 (x ( „_ 1)AT AT 0( ljf V* )(x n AT AT 0) neN converge almost surely and 
are nonincreasing in expectation, both under P". Let A n (x) := lim,^,^ E™[£„]. We 
then have 

V*(x) = E*[C ]£ lim E«[C„] 

(4.15) = to(EJ[W n (n,x)] +EJ[l, NO (x (n _ 1)ATA ^XliV*)C*nATAT')]) 
^ V{n,x). 

The assertion is now an immediate consequence of (4.15). □ 

Proof of Theorem (2.18). Suppose that (i) holds. Since E™[£„] is nonincreasing with n 
it follows that E^[£ n+1 ] = E*[£„] = . . . = E^[£ ] = V*(x) for every n e N. Therefore, 
equality must hold P^-almost surely in (4.14), and (ii) follows. 

Suppose that (ii) holds. Then equality holds in (4.14) almost surely under P", and 
therefore P^-almost everywhere on the set {x nATAT / e K \ 0} = {t A z' > n} we have 
T / V*(x„,a n ) = V**(x„), and (iii) follows. 

Suppose that (iii) holds. Then taking expectations in (4.14) we arrive at E"[£ n+1 ] = 
E"[CJ = • ■ ■ = E^[Co] = V*(x). As a result we have A"(x) = V*(x), and (i) follows. 

□ 



Proof of Theorem (2.19). It follows readily from the definition of the stopping times z 
and %' that the process (<^) neNo defined in (2.20) is a bounded process, and by assump- 
tion it is a (^ n ) neNo -martingale under PJ . Doob's Optional Sampling Theorem [Rao 
and Swift, 2006, Theorem 2, p. 422] applied to (£„) neN at the stopping time tAt' 
gives us 

Ef K^]=e?K]=vXxI 

where the last equality follows from the definition of (' . From (2.15) we get 



Ef [C; AT ,] = Ef [w tAt ,_ x (V,x) + Wi^., 



E™ 



tAt'— 1 

S 

t=0 



)(i Jf -v')(w)] 

i U t ) + 



= Ef [wCxw-iXi* • V)(w)] • 

By definition of z and t', 1k x o( x tat'-i) equals 1 on {t A z' < oo}, and by our hypothe- 
ses the set (tAt'< oo} is a Pf -full-measure set. Continuing from the last equality 
above we arrive at 

Ef [O] = Ef [ W <oo; (1k • V')(* TAT 0] 

= Ef [l {TAT ' <oo} (l fT<T ' } l Jf (x T )V / (x T )+ l^ljcCx.OV'fx^))] 

(4.16) = E* [1( tAt ' <00 }1{t<t'}] 

= Pf (t<t',t<oo), 

where the equality in (4.16) follows from the assumptions on V' and the definitions 
of z and z' . Collecting the equations above we get V(x) = Pf (t < z',z < oo) as 



asserted. 



□ 



It is of interest to note that the hypotheses of Theorem (2.19) requires at least one of 
the stopping times z or z' to be finite. Let us examine the case of zAz' being oo on a set 
of positive probability. Following the proof of Theorem (2.19), we see that in this case 
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we have to agree on the value of V'(x TAT /) on {tAt' = oo}. If lim^^ V^(tt*, x) exists, 
then we can always let V'(x TAT /) take this value on the set {tAt' = oo}. However, 
the context of the problem offers another alternative, namely, to set V'(x TAT /) = on 
{t At' = oo}. This is because if x t e K \ O for all t e N , then the value of x TAT / is of 
no consequence at all. 

§5. Conclusions and Future Work 

The purpose of this article was to present a dynamic programming based solution 
to the problem of maximizing the probability of attaining a target set before hitting 
a cemetery set, and furnish an alternative martingale characterization of optimality 
in terms of thrifty and equalizing policies. Several related problems of interest were 
sketched in §3.1. Some of these problems do not admit an immediate solution in 
the dynamic programming framework we established here because of our central as- 
sumption that the cost-per-stage function is nonnegative. This issue deserves further 
investigation. 

The results in this article also provide clear indications to the possibility of devel- 
oping verification tools for probabilistic computation tree logic [Kwiatkowska et al., 
2007] in terms of dynamic programming operators. This matter is under investiga- 
tion and will be reported in [Ramponi et al., 2009]. Implementation of the dynamic- 
programming algorithm in this article is challenging due to integration over subsets 
of the state-space, and suboptimal policies are needed. In this context development 
of a possible connection with 'greedy-time-optimal' policies [Meyn, 2008, Chapters 4, 
7], originally proposed as a tractable alternative to optimal policies in demand-driven 
large-scale production systems, is being sought. 
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